Skip to content

HDFS Asset URI: allow empty netloc for Hadoop fs.defaultFS#68022

Draft
stegololz wants to merge 2 commits into
apache:mainfrom
stegololz:fix/hdfs-asset-uri-allow-default-fs
Draft

HDFS Asset URI: allow empty netloc for Hadoop fs.defaultFS#68022
stegololz wants to merge 2 commits into
apache:mainfrom
stegololz:fix/hdfs-asset-uri-allow-default-fs

Conversation

@stegololz

Copy link
Copy Markdown
Contributor

Summary

Relax airflow.providers.apache.hdfs.assets.hdfs.sanitize_uri to accept the canonical hdfs:///path form (empty netloc). Previously rejected with ValueError: URI format hdfs:// must contain a namenode host.

Why

  • RFC 3986: the authority component of a URI is optional. hdfs:///path is well-formed.
  • Hadoop semantics: an empty authority means "resolve via fs.defaultFS from core-site.xml". This is the standard idiom for portable Spark/Hive/MapReduce jobs that must not hard-code a namenode — same shape as file:///etc/hosts.
  • The strict check was introduced in feat: Add uri sanitizers and asset factories for new schemes #66426 (alongside other new-scheme sanitizers). It is more restrictive than the Hadoop convention and breaks any DAG using Asset("hdfs:///apps/x/file.parquet") at parse time.

Change

  • providers/apache/hdfs/.../assets/hdfs.py: drop the "must contain a namenode host" check; keep the path-required check.
  • providers/apache/hdfs/.../tests/.../test_hdfs.py:
    • Add positive cases for hdfs:///apps/myapp/... (empty netloc) — pass.
    • Add negative case hdfs://namenode:8020 (no path) — fail.
    • Add test_convert_asset_to_openlineage_default_fs covering OpenLineage emission with empty netloc.

convert_asset_to_openlineage already tolerates an empty netloc (f"hdfs://{parsed.netloc}" yields hdfs:// namespace), so no functional change there.

Related

Gen-AI disclosure

This PR was prepared with Gen-AI assistance (Claude). I reviewed all generated code.

dabla
dabla previously approved these changes Jun 4, 2026
@stegololz stegololz force-pushed the fix/hdfs-asset-uri-allow-default-fs branch from 19e5a39 to 4536410 Compare June 4, 2026 16:21
@stegololz

Copy link
Copy Markdown
Contributor Author

I fixed the tests but i'm not happy with the current implementation.

The small caveat we have now that you want to know about: when the input is hdfs:///path, the asset is now stored as hdfs:/path. This is a consequence of how urllib.parse represents and serializes URIs, not a deliberate design choice on my part:

>>> from urllib.parse import urlsplit, urlunsplit
>>> urlsplit("hdfs:///apps/x") == urlsplit("hdfs:/apps/x")
True
>>> urlunsplit(urlsplit("hdfs:///apps/x"))
'hdfs:/apps/x'

urlsplit cannot distinguish the two forms: they parse to identical SplitResult instances with an empty netloc and urlunsplit cannot emit // for an empty authority; it omits the // entirely. Since the provider hook only sees and returns a SplitResult, and the surrounding _sanitize_uri in task-sdk calls urlunsplit on the result, there is no way for the provider on its own to preserve the /// form in the stored URI.

Round-trip identity is still stable: both hdfs:///apps/x and hdfs:/apps/x normalize to hdfs:/apps/x, so asset matching remains consistent across re-parses. For the OpenLineage conversion, an empty netloc produces a hdfs:// namespace, which is what consumers expect for fs.defaultFS-resolved paths.

If preserving the literal hdfs:///path form in storage is considered worth the additional surface area, it could be done with a small opt-in in _sanitize_uri (for example, a normalizer.preserve_empty_authority = True attribute set when we want the empty-authority form retained, and a corresponding string-level fix-up after urlunsplit). That keeps the behavior change scoped to opt-in normalizers (only HDFS would set it today), and leaves others untouched.

I deliberately left this out of the current PR to keep the change minimal and provider-local. Happy to follow up with that task-sdk change if someone thinks it is worth.

@stegololz stegololz marked this pull request as draft June 4, 2026 16:34
@dabla dabla dismissed their stale review June 4, 2026 16:38

More changes are needed

The hdfs asset URI sanitizer rejected hdfs:///path as missing a
namenode host. Per RFC 3986 the authority component is optional;
per Hadoop semantics an empty authority means 'resolve via
fs.defaultFS from core-site.xml' — i.e. hdfs:///apps/x is the
canonical form for jobs that must not hard-code a namenode.

Relax sanitize_uri to require only a non-empty path, and add
positive + negative parametrized tests covering the default-fs
form and the corresponding OpenLineage conversion.
@stegololz stegololz force-pushed the fix/hdfs-asset-uri-allow-default-fs branch from 4536410 to 5f6d16c Compare June 13, 2026 09:46
@stegololz

Copy link
Copy Markdown
Contributor Author

Update after digging into normalization.

Dropping the host check alone is not enough: the SDK normalizes asset URIs with urllib.parse.urlunsplit, which only keeps the // authority for schemes in uses_netloc. hdfs is not in that list, so hdfs:///apps/x was being silently rewritten to hdfs:/apps/x. Since asset URIs are primary keys and feed OpenLineage that silent rewrite is not acceptable.

Current branch:

Register hdfs in uses_netloc so hdfs:///path round-trips intact, like file already does.
hdfs://namenode:8020/path is unaffected;

In current behaviour hdfs:/path canonicalizes to hdfs:///path. Is this solution acceptable knowing that Hadoop treats them differently at the URI level (hdfs:/path and hdfs:///path could be distinct assets).

Comment on lines +30 to +32
# Preserve the empty-authority "hdfs:///path" (fs.defaultFS) form through urlunsplit, like "file".
if "hdfs" not in urllib.parse.uses_netloc:
urllib.parse.uses_netloc.append("hdfs")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think it’s a good idea to do this; uses_netloc is undocumented and should be considered private.

When is this needed?

@stegololz stegololz Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can drop the line or implement a similar behaviour somewhere else, it is not mandatory.

The SDK normalizes asset URIs through urlunsplit here:

if (normalizer := _get_uri_normalizer(normalized_scheme)) is not None:
parsed = normalizer(parsed)
return urllib.parse.urlunsplit(parsed)

That makes hdfs:///apps/x normalize to hdfs:/apps/x. I'd like to keep the canonical hdfs:/// form, which is why the if is there.

Caveat: as long as I stay on the provider, I won't be able to make a distinction between hdfs:/path and hdfs:///path (both end up as hdfs:///path), because the normalizer only sees the parsed result where both already have an empty netloc. To keep them distinct I'd have to also modify the Task SDK.

What would be the preference here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants